96 research outputs found
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
The ability to collect a large dataset of human preferences from
text-to-image users is usually limited to companies, making such datasets
inaccessible to the public. To address this issue, we create a web app that
enables text-to-image users to generate images and specify their preferences.
Using this web app we build Pick-a-Pic, a large, open dataset of text-to-image
prompts and real users' preferences over generated images. We leverage this
dataset to train a CLIP-based scoring function, PickScore, which exhibits
superhuman performance on the task of predicting human preferences. Then, we
test PickScore's ability to perform model evaluation and observe that it
correlates better with human rankings than other automatic evaluation metrics.
Therefore, we recommend using PickScore for evaluating future text-to-image
generation models, and using Pick-a-Pic prompts as a more relevant dataset than
MS-COCO. Finally, we demonstrate how PickScore can enhance existing
text-to-image models via ranking
Audio Language Modeling using Perceptually-Guided Discrete Representations
In this work, we study the task of Audio Language Modeling, in which we aim
at learning probabilistic models for audio that can be used for generation and
completion. We use a state-of-the-art perceptually-guided audio compression
model, to encode audio to discrete representations. Next, we train a
transformer-based causal language model using these representations. At
inference time, we perform audio auto-completion by encoding an audio prompt as
a discrete sequence, feeding it to the audio language model, sampling from the
model, and synthesizing the corresponding time-domain signal. We evaluate the
quality of samples generated by our method on Audioset, the largest dataset for
general audio to date, and show that it is superior to the evaluated baseline
audio encoders. We additionally provide an extensive analysis to better
understand the trade-off between audio-quality and language-modeling
capabilities. Samples:link
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
We propose using self-supervised discrete representations for the task of
speech resynthesis. To generate disentangled representation, we separately
extract low-bitrate representations for speech content, prosodic information,
and speaker identity. This allows to synthesize speech in a controllable
manner. We analyze various state-of-the-art, self-supervised representation
learning methods and shed light on the advantages of each method while
considering reconstruction quality and disentanglement properties.
Specifically, we evaluate the F0 reconstruction, speaker identification
performance (for both resynthesis and voice conversion), recordings'
intelligibility, and overall quality using subjective human evaluation. Lastly,
we demonstrate how these representations can be used for an ultra-lightweight
speech codec. Using the obtained representations, we can get to a rate of 365
bits per second while providing better speech quality than the baseline
methods. Audio samples can be found under the following link:
speechbot.github.io/resynthesis.Comment: In Proceedings of Interspeech 202
AudioGen: Textually Guided Audio Generation
We tackle the problem of generating audio samples conditioned on descriptive
text captions. In this work, we propose AaudioGen, an auto-regressive
generative model that generates audio samples conditioned on text inputs.
AudioGen operates on a learnt discrete audio representation. The task of
text-to-audio generation poses multiple challenges. Due to the way audio
travels through a medium, differentiating ``objects'' can be a difficult task
(e.g., separating multiple people simultaneously speaking). This is further
complicated by real-world recording conditions (e.g., background noise,
reverberation, etc.). Scarce text annotations impose another constraint,
limiting the ability to scale models. Finally, modeling high-fidelity audio
requires encoding audio at high sampling rate, leading to extremely long
sequences. To alleviate the aforementioned challenges we propose an
augmentation technique that mixes different audio samples, driving the model to
internally learn to separate multiple sources. We curated 10 datasets
containing different types of audio and text annotations to handle the scarcity
of text-audio data points. For faster inference, we explore the use of
multi-stream modeling, allowing the use of shorter sequences while maintaining
a similar bitrate and perceptual quality. We apply classifier-free guidance to
improve adherence to text. Comparing to the evaluated baselines, AudioGen
outperforms over both objective and subjective metrics. Finally, we explore the
ability of the proposed method to generate audio continuation conditionally and
unconditionally. Samples: https://tinyurl.com/audiogen-text2audi
- …